Temperature data option calculation

Author
Affiliation

Pietro Rota

Modified

May 11, 2025

1 Weather Derivatives Temperature Options

Full dependencies list, click to expand list
[1] "R version 4.4.3 (2025-02-28 ucrt)"
[1] "x86_64-w64-mingw32/x64"

Functions loaded from main

Functions.1 Functions.2 Functions.3
check_acc desc_df find_outliers
quickplot remove_outliers RSS
show_df smart_round

Packages required to run this file

Show code
required_packages(file)
Required.Packages.1 Required.Packages.2 Required.Packages.3
caret dplyr DT
e1071 fBasics forecast
ggplot2 gt gtExtras
leaflet lubridate MASS
nlme NMOF PerformanceAnalytics
plotly quantmod reshape2
rugarch splines stats4
tidyr timeDate timeSeries
TTR xts zoo

Functions defined in this document

Required.Functions.1 Required.Functions.2 Required.Functions.3
apply_convolution sin_component create_spline_plot
Show code
cat("time of creation", "\n")
time of creation 
Show code
print(file.info(file)$ctime, "\n")
Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone '
'
[1] "2024-08-27 15:52:39 GMT"
Show code
cat("LAST MODIFICATION", "\n")
LAST MODIFICATION 
Show code
print(file.info(file)$mtime, "\n")
Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone '
'
[1] "2025-05-06 15:27:18 GMT"
Show code
cat("Last Access", "\n")
Last Access 
Show code
print(file.info(file)$mtime, "\n")
Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone '
'
[1] "2025-05-06 15:27:18 GMT"

References:

  1. Weather Futures and Options (CME Group Inc, 2021) [LINK]

  2. Managing Climate Risk with CME Group Weather Futures and Options (Dominic Sutton-Vermeule, 20-Jan 2021) [LINK]

  3. MANAGING CLIMATE RISK IN THE U.S. FINANCIAL SYSTEM, ISBN: 978-0-578-74841-2 (2020) [LINK]

2 Climate data download

Much of the weather market is dominated by temperature derivatives, aimed to protect the holder from unexpected temperature prints. The underlying used to trade these contracts is either Heating Degree Days (HDD) or Cooling Degree Days (CDD). For a specific day n, these can be defined as:

The most popular temperature instruments traded are options, whose payoff function depends on a cumulative sum over a longer period, usually an entire season:

\(HDDn = \max(T_{ref} - T_n, 0)\)

\(CDDn = \max(T_n - T_{ref}, 0)\)

Calls protect you from extreme circumstances, meanwhile puts protect you from mild climates
Option Type Protection Against Exercise When Payout Example
HDD call Overly cold winters HDD > K \(\alpha \cdot\) (HDD - K) Farmers
HDD put Overly warm winters HDD < K \(\alpha \cdot\) (K - HDD) Ski resorts
CDD call Overly hot summers CDD > K \(\alpha \cdot\) (CDD - K) Utilities
CDD put Overly cold summers CDD < K \(\alpha \cdot\) (K - CDD) Beaches

NASA Prediction Of Worldwide Energy Resources (POWER) | Data Access Viewer (DAV) [LINK]

This dataset is from the Agroclimatology community, which is crucial for ensuring reproducibility in climate research.

This study utilizes temperature data obtained from the MERRA-2 (Modern-Era Retrospective analysis for Research and Applications, Version 2) dataset, which provides detailed meteorological measurements at a spatial resolution of 0.5 x 0.625 degrees latitude/longitude.

The specific region analyzed, with an average elevation of 288.19 meters, is located at a latitude of 45.5330, and longitude of 9.1911.

It is important to note that the dataset uses a value of -999 to denote missing data, which I’ve removed and replaced with the mean of the overall dataset. Either when a parameter cannot be computed or falls outside the range of available source data

Parameter(s):

  • T2M_MAX = Temperature at 2 Meters Max for the day (C)

  • T2M_MIN = Temperature at 2 Meters Min for the day (C)

  • T2M_AVG = \(\frac {T_{MIN}+T_{MAX}}2\) Temperature at 2 Meters Average for the day (C)

Temperature at 2 meters refers to the air temperature measured 2 meters (approximately 6.5 feet) above the ground. This is the standard height for temperature measurements used in meteorology because it minimizes the effects of ground heating or cooling, providing a more accurate reflection of the ambient air temperature experienced by humans and relevant to various economic activities.

Why? Temperature at 2 meters is commonly used as the basis for weather derivatives. This is because:

  • Standardization: Temperature at 2 meters is a standardized and widely available data point, making it reliable for use in financial contracts.

  • Relevance: Many weather-related risks, such as heating degree days (HDD) and cooling degree days (CDD), are calculated based on temperatures at this height. These metrics are commonly used in weather derivatives to hedge against risks related to heating and cooling needs.

  • Accuracy: Because it represents the temperature most relevant to human activities and energy consumption, it is directly applicable for creating and settling weather derivatives that are based on temperature-related indices.

Show code
ORIGINAL_DATASET <- read.csv("DISNEY DATA.csv", skip = 10)

DATASET <- ORIGINAL_DATASET %>% 
  mutate(T_MAX=remove_outliers(T2M_MAX, fill = "NA") %>% na.approx) %>% 
  mutate(T_MIN=remove_outliers(T2M_MIN, fill = "NA") %>% na.approx) %>% 
  mutate(DAY = as.Date(ORIGINAL_DATASET$DOY - 1, origin = paste0(ORIGINAL_DATASET$YEAR, "-01-01"))) %>%
  mutate(Month=month(DAY)) %>% 
  dplyr::select(DAY, YEAR, Month, DOY, T_MAX, T_MIN)
  

DATASET$T_AVG <- apply(DATASET[c("T_MAX", "T_MIN")], 1, mean)

ORIGINAL_DATASET[ifelse(find_outliers(ORIGINAL_DATASET$T2M_MAX)==0,FALSE,TRUE),] %>% 
  gt() %>% 
  opt_stylize(5) %>% 
  tab_header(title = "Days where the sensor malfunctioned", subtitle = "identified by remove outliers")
Days where the sensor malfunctioned
identified by remove outliers
YEAR DOY T2M_MAX T2M_MIN
1981 12 8.16 -2.85
1981 13 10.84 -5.76
1981 17 12.26 1.17
1981 34 11.99 0.42
1981 344 10.79 1.82
1981 345 11.31 -1.37
1981 353 6.68 -3.02
1981 354 11.78 -1.04
1982 10 11.52 3.68
1982 11 5.57 -3.92
1982 15 11.81 -3.58
1983 12 12.42 3.11
1983 35 11.73 4.37
1983 70 12.16 2.57
1983 359 2.64 -6.67
1983 360 5.66 -6.28
1983 364 8.38 2.49
1984 37 10.94 1.64
1984 38 11.59 -1.49
1984 60 11.50 3.08
1985 21 3.62 -7.77
1985 22 9.55 -6.97
1985 26 12.23 2.51
1985 355 10.25 2.43
1985 360 5.98 -5.62
1986 27 12.08 -3.50
1986 28 6.62 -6.49
1986 44 9.31 1.81
1986 60 11.12 3.07
1987 2 12.09 5.07
1987 23 10.74 2.36
1987 27 10.34 1.31
1987 40 12.22 3.41
1988 23 11.91 4.52
1988 27 11.38 -0.66
1988 37 10.72 2.61
1988 352 11.87 2.13
1988 353 10.96 -0.87
1989 54 11.93 1.22
1989 55 11.02 -2.31
1989 356 11.01 4.48
1989 357 3.96 -5.65
1989 358 2.93 -7.06
1989 359 6.91 -6.19
1989 360 12.48 1.21
1991 47 10.57 -2.11
1993 73 11.06 -1.13
1993 74 12.26 -1.26
1995 21 12.53 5.36
1995 24 11.50 2.22
1995 37 11.13 2.31
1995 39 12.54 -0.05
1995 40 11.85 -5.06
1995 355 12.46 3.34
1995 358 10.11 -0.18
1995 359 11.09 -2.34
1995 361 11.66 0.39
1995 362 10.75 2.38
1996 4 12.45 2.40
1996 8 5.93 -3.15
1996 9 10.86 -4.39
1996 13 12.39 2.03
1996 35 7.82 -2.01
1996 36 10.83 -5.77
1996 48 11.07 -1.63
1996 69 10.67 -0.36
1996 355 8.84 -0.84
1997 17 12.17 2.55
1997 18 9.84 -0.87
1997 349 10.82 5.33
1997 362 11.31 3.92
1998 71 12.11 2.55
1999 5 9.92 -0.66
2000 26 9.80 0.10
2000 27 12.14 -2.22
2000 355 8.66 -3.37
2000 365 11.31 0.83
2000 366 9.39 -3.87
2001 2 11.51 1.99
2001 4 11.98 -1.01
2002 3 10.54 2.66
2002 4 9.83 -1.86
2002 8 11.04 -0.50
2003 4 12.22 1.24
2003 18 8.54 0.27
2003 24 7.61 -4.67
2004 355 9.85 1.58
2006 7 12.01 1.96
2006 43 10.89 1.85
2006 44 11.92 -0.54
2007 29 12.44 2.16
2008 2 9.18 -0.03
2008 3 12.09 -2.64
2009 21 9.70 -1.86
2009 35 11.22 -0.06
2009 36 10.40 -3.58
2010 3 5.91 0.77
2010 4 10.24 -2.32
2010 5 8.32 -1.74
2010 6 8.99 -4.08
2010 9 3.75 -3.87
2010 10 6.38 -3.86
2010 11 10.25 -4.76
2010 42 12.41 0.45
2010 44 11.09 2.17
2010 341 12.04 -0.16
2010 347 9.68 0.52
2010 348 9.33 -3.88
2010 361 9.98 -1.86
2011 13 12.19 -2.55
2012 3 9.31 0.72
2012 43 12.04 0.37
2014 7 8.78 0.53
2015 50 9.61 0.02
2015 51 12.29 -3.24
2016 19 11.64 2.67
2018 3 12.50 0.93
2018 4 11.36 -1.89
2018 5 11.25 -2.00
2018 18 11.33 -3.87
2020 360 11.69 3.31
2020 361 11.32 -0.01
2022 29 10.83 1.53
2022 358 7.49 -2.49
2022 359 7.55 -2.74
2022 360 10.93 1.28
2023 14 12.20 3.14
2025 22 10.85 4.49
2025 23 10.72 5.11
Show code
datatable(smart_round(DATASET, digits = 3))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, digits = digits)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

2.1 Initial data visualization

Show code
cleandataset <- DATASET %>% 
  select(T_MAX, T_MIN, T_AVG) %>% 
  xts(order.by = DATASET$DAY)

desc_df(cleandataset)
n mean sd median trimmed min max range skew kurtosis se %NA Q.1% Q.25% Q.75% Q.99%
T_MAX 16199 27.5756 4.8469 28.49 27.9438 12.590 40.30 27.710 -0.6797 0.1913 0.0381 0 14.3494 24.83 30.950 36.6404
T_MIN 16199 17.5620 6.1932 18.88 18.2900 -4.080 27.31 31.390 -0.8820 0.0307 0.0487 0 0.7700 13.85 22.870 25.3602
T_AVG 16199 22.5688 5.3084 23.90 23.1589 4.668 33.17 28.502 -0.8842 0.0769 0.0417 0 8.2897 19.50 26.745 30.2354
Show code
cleandataset %>% 
  ggplot(aes(x=index(.)))+
  geom_line(aes(y=T_MAX, color = "T_MAX"))+
  geom_line(aes(y=T_MIN, color = "T_MIN"))+
  geom_line(aes(y=T_AVG, color = "T_AVG"))+
  labs(title = "Last 43 years of recorded data", y="Temperature", x=NULL)+
  scale_color_manual(name = "Temps", 
                     values = c(T_MAX = "indianred3",T_MIN = "lightblue",T_AVG = "lightgreen"))

However this is not that useful so let me zoom in on the last third of the database (approximately 2014 onward)

Show code
NDAYS <- nrow(cleandataset)
lookback <- 365*10

{tail(cleandataset, lookback) %>% 
  as.data.frame() %>% 
  mutate(year = index(tail(cleandataset, lookback))) %>% 
  ggplot(aes(x = year))+
  geom_line(aes(y=T_MAX, color = "T_MAX"))+
  geom_line(aes(y=T_MIN, color = "T_MIN"))+
  geom_line(aes(y=T_AVG, color = "T_AVG"))+
  labs(title = "Last 10 years of recorded data", y="Temperature", x=NULL)+
  scale_color_manual(name = "Temps", 
                     values = c(T_MAX = "indianred3",T_MIN = "lightblue",T_AVG = "lightgreen"))} %>% 
  ggplotly() 
Show code
# library(gganimate)
# animplot1 <- plot1 +
#   labs(subtitle = "year: {frame_time}") +
#   transition_reveal(year) + 
#   view_follow(fixed_y = TRUE)
# 
# animate(animplot1, nframes = 20, renderer = gifski_renderer(), fps = 30, duration = 10, end_pause = 60, res = 100)

2.2 Seasonal analysis

Now i can look at the distribution of my database, first i need to distinguish the data between summer periods and winter periods by taking the day of the year and splitting it so that

Winter ~ October-March

Summer ~ April-September

Show code
 DATASET_seas <- DATASET %>%
  group_by(Month < 4|Month>9) %>%
  rename("Season"="Month < 4 | Month > 9") %>%
  mutate(Season = ifelse(Season, "Winter", "Summer"))
  
DATASET_seas[-1] %>% 
  xts(order.by = DATASET$DAY)%>% 
  show_df() %>% 
  gt() %>%
  tab_header(title = "Temperature Overview") %>%
  opt_stylize(style = 5, add_row_striping = TRUE) %>%
  cols_align(align = "center") %>%
  sub_missing(columns = everything(), missing_text = "⋮")
Temperature Overview
DATE YEAR Month DOY T_MAX T_MIN T_AVG Season
1981-01-01 1981 1 1 18.96000 3.7600000 11.360000 Winter
1981-01-02 1981 1 2 14.78000 4.7100000 9.745000 Winter
1981-01-03 1981 1 3 17.66000 1.0000000 9.330000 Winter
1981-01-04 1981 1 4 19.79000 2.9600000 11.375000 Winter
1981-01-05 1981 1 5 14.88000 4.4500000 9.665000 Winter
2025-05-04 2025 5 124 31.01000 21.9100000 26.460000 Summer
2025-05-05 2025 5 125 32.31000 20.0000000 26.155000 Summer
2025-05-06 2025 5 126 34.18000 22.2700000 28.225000 Summer
2025-05-07 2025 5 127 34.48000 23.2600000 28.870000 Summer
2025-05-08 2025 5 128 33.77000 22.3900000 28.080000 Summer
Show code
winter_dataset <- DATASET_seas[DATASET_seas$Season=="Winter",]
summer_dataset <- DATASET_seas[DATASET_seas$Season=="Summer",]

cat(paste0("The difference in number of rows is approximately: ", round(nrow(summer_dataset) / nrow(winter_dataset) - 1, 4)*100,"%", ", or ", round((nrow(summer_dataset) / nrow(winter_dataset) - 1)*NDAYS, 1), " days"))
The difference in number of rows is approximately: -0.23%, or -38 days
Show code
grid.arrange(ncol=2,
ggplot(winter_dataset)+
  geom_histogram(aes(x = T_MAX, fill = "T_MAX"), alpha=0.8, bins = 80)+
  geom_histogram(aes(x = T_MIN, fill = "T_MIN"), alpha=0.8, bins = 80)+
  geom_histogram(aes(x = T_AVG, fill = "T_AVG"), alpha=0.8, bins = 80)+
  labs(title = "Distribution charts of winter",x=NULL)+
  scale_fill_manual(name = NULL, 
                     values = c(T_MAX = "indianred3",T_MIN = "lightblue",T_AVG = "lightgreen"))+
  theme(legend.position = "bottom")

,

ggplot(summer_dataset)+
  geom_histogram(aes(x = T_MAX, fill = "T_MAX"), alpha=0.8, bins = 80)+
  geom_histogram(aes(x = T_MIN, fill = "T_MIN"), alpha=0.8, bins = 80)+
  geom_histogram(aes(x = T_AVG, fill = "T_AVG"), alpha=0.8, bins = 80)+
  labs(title = "Distribution charts of summer",x=NULL)+
  scale_fill_manual(name = NULL, 
                    values = c(T_MAX = "indianred3",T_MIN = "lightblue",T_AVG = "lightgreen"))+
  theme(legend.position = "bottom")

)

Its with this choice of season im still able to keep the data very balanced. clear to see that there

Lastly I’m able to plot just the two average temperatures of winter and of summer for the average temperatures

Show code
ggplot()+
  geom_histogram(data = winter_dataset, aes(x = T_AVG, fill = "Winter"), alpha=0.8, bins = 80)+
  geom_histogram(data = summer_dataset, aes(x = T_AVG, fill = "Summer"), alpha=0.8, bins = 80)+
  labs(title = "Distribution charts of the 2 averages",x=NULL)+
    scale_fill_manual(name = NULL, 
                    values = c(Winter="steelblue", Summer="orange"))+
  theme(legend.position = "bottom")

Just to understand the records of my dataset i wanted to see what months on record had the hottest and coldest days across all dataset

Show code
DATASET_seas %>% 
  group_by(Month) %>% 
  summarise(T_min_min = min(T_MIN),
            T_min_max = max(T_MIN),
            T_max_min = min(T_MAX),
            T_max_max = max(T_MAX)) %>% 
  round(3) %>%
  mutate(Month = month.abb) %>% 
  gt() %>% 
  opt_stylize(5) %>% 
  tab_header("Data exploration", "what is the highest and lowest in my database") %>% 
  cols_label(T_min_min = "Min",
             T_min_max = "Max",
             T_max_min = "Min",
             T_max_max = "Max") %>% 
  tab_spanner(label = "T_MIN", columns = 2:3) %>%  
  tab_spanner(label = "T_MAX", columns = 4:5)
Data exploration
what is the highest and lowest in my database
Month
T_MIN
T_MAX
Min Max Min Max
Jan -4.08 20.90 12.59 29.78
Feb -3.58 20.24 12.62 32.49
Mar -1.26 22.43 13.21 35.29
Apr 1.63 23.82 14.38 36.70
May 9.48 25.50 21.34 39.12
Jun 13.06 26.49 24.94 40.30
Jul 19.55 27.13 26.31 38.58
Aug 19.61 27.31 25.72 37.71
Sep 13.40 26.03 23.38 35.13
Oct 2.68 25.56 15.78 34.12
Nov 1.66 24.50 12.88 32.13
Dec -3.90 22.12 12.61 30.43

3 Seasonal decomposition

The first step in pricing temperature options is to decompose the time series into distinct components that each represent different underlying patterns. This process is essential for understanding the structure of the data and for developing predictive models. In basic temrs it can be expressed as the sum of three main components:\(y_t=T_t+S_t+e_t\), Here, \(T_t\) is the trend component that captures the long-term movement, \(S_t\) represents the seasonal variations (periodic patterns), and \(e_t\) refers to the residuals or noise in the data.

which in our case becomes \[\overline{T_t} = T_{trend} + T_{seasonal}+ Residuals\] The approach was to develop each component indipendently

[T_{trend} = a + bt]

:

The trend component could be as simple as a straight line or a moving average

This form captures the cyclical nature of temperature changes throughout the year, like daily or annual temperature fluctuations.

[ T_{seasonal} = (t + ) ]

\[ T_{seasonal} = \alpha \cdot \sin(\omega \cdot t + \theta) \]

  • \(\alpha\) represents the amplitude

  • \(\omega\) is the angular frequency (related to the period of the cycle)

  • \(\theta\) is the phase shift

And lastly the residuals are just: \[Residuals = T_t - T_{trend} + T_{seasonal}\]:

Recent research has suggested that partial Fourier decomposition can be effective in modeling temperature data, particularly by omitting the cosine component. This approach simplifies the seasonal model while still accurately representing the cyclical nature of temperature changes. The sinusoidal term is often sufficient to describe seasonal temperature variations, making the model both interpretable and computationally efficient.

Before this I used a denoising procedure using Gaussian Convolution Filter as it also enhances the effectiveness of the Fourier transformation by reducing noise, allowing you to identify the dominant frequencies or cycles in your temperature data more accurately.

  • Gaussian Convolution Filter:

    • A convolution is a mathematical operation used to blend or smooth data. It involves applying a kernel (a small array of weights) across the data to compute a weighted average. The Gaussian kernel is a specific type of convolution filter that assigns weights based on the Gaussian (normal) distribution, which is a bell-shaped curve.

    • The Gaussian kernel is widely used for smoothing because it gives more weight to the central points (closer to the middle) while gradually reducing the weights for points further away. This is mathematically expressed as:

\[ G(x)=\frac{1}{\sqrt{2π\sigma^2}}\cdot e^{-\frac{x^2}{2\sigma^2}} \]

  • Moving Window: The Gaussian kernel slides across each point in the time series. For each position, it computes a weighted sum where the weights are defined by the Gaussian kernel.

  • Local Averaging: The computed value represents a smoothed average of the points within the window’s range. Points closer to the center of the window contribute more heavily to the average than those farther away, as per the Gaussian weights.

  • Symmetric Filtering: using 2 sides makes this operation symmetric, applying the kernel both forward and backward across the data, ensuring the filtering effect is centered.

  • Conclusion and data inspection: in visual terms the denoised data should look smoother with slightly lower peaks

Show code
temps <- DATASET
apply_convolution <- function(x, kernel) {
  # Use filter() from stats package to apply convolution
  filtered <- stats::filter(x, kernel, sides = 2)  # Use sides = 2 for symmetric filter
  return(filtered)
}

# DENOISED - convolution with a window of 90 days
kernel <- dnorm(-3:3)
data.frame("Gaussian_Kernel" = round(kernel, 10))
Gaussian_Kernel
0.0044318
0.0539910
0.2419707
0.3989423
0.2419707
0.0539910
0.0044318
Show code
temps$Denoised <- apply_convolution(temps$T_AVG, kernel)
temps$Denoised <- na.fill(temps$Denoised, mean(temps$Denoised, na.rm = TRUE))

temps$Trend <- SMA(temps$Denoised, n = lookback)

library(nlme, quietly = T, warn.conflicts = F)

# Define the model
sin_component <- function(t, a, b, alpha, theta) {
  omega <- 2 * pi / 365.25
  a + b * t + alpha * sin(omega * t + theta)
}
omega <- 2 * pi / 365.25

# Fit model using non linear squares
temps$NUM_DAY <- 1:nrow(temps)

fit <- nls(Denoised ~ sin_component(NUM_DAY, a, b, alpha, theta),
           data = temps,
           start = list(a = 1, b = 0, alpha = 1, theta = 0))


# Get coefficients and confidence intervals for the model
params <- coef(fit)
confint_fit <- confint(fit)
Waiting for profiling to be done...
Show code
temps$SEAS <- params["alpha"] * sin(omega * temps$NUM_DAY + params["theta"])
temps$TREND <- params["a"] + params["b"] * temps$NUM_DAY
temps$BAR <- temps$TREND + temps$SEAS
temps$RESID <-  temps$T_AVG - temps$TREND - temps$SEAS

just to be certain i looked at if the residuals and the fitted values matched the \(\overline T\) and \(Residuals\) by checking percentage of same values as the rounding decimals increase The logic is that if the data sets are anything alike (like 2 different models results for the same dataset) you should see how far in the rounding are the data sets similar

Show code
cat("T_BAR VS fitted from the non linear squares \n")
T_BAR VS fitted from the non linear squares 
Show code
check_acc(temps$BAR, fitted(fit),15)
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

Show code
cat("T_BAR VS fitted from the non linear squares \n")
T_BAR VS fitted from the non linear squares 
Show code
check_acc(temps$RESID, residuals(fit),15)
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

3.1 Model performance

for assessing my model performance I’m looking at the values and confidence intervals and looking at the Residual sum of squares and the mean absolute error

  • The RSS function calculates the Residual Sum of Squares, which is a measure of the discrepancy between the observed data and the model’s predictions. A lower RSS indicates a better fit of the model to the data, meaning that the model’s predictions are closer to the actual observed values.

  • The MAE function is the average of the absolute differences between the observed and predicted values. It gives an idea of how far, on average, the predictions are from the actual values, without considering the direction of the error. A lower MAE indicates that the model’s predictions are generally close to the observed data.

a :  21.801  CI ~normally [ 21.718 , 21.883 ]
b :  0  CI ~normally [ 0 , 0 ]
alpha :  -6.034  CI ~normally [ -6.092 , -5.975 ]
theta :  -5.008  CI ~normally [ -5.018 , -4.998 ]
  RSS model sine curve: 157805.5 
  MAE model fit: 2.35 

3.2 Visualization of results

Show code
# plot denoised 
ggplot(tail(temps, lookback), aes(x=DAY))+
  geom_point(aes(y = T_AVG, color = "Average"), size = 1)+
  geom_point(aes(y = Denoised, color = "Denoised"), size = 1)+
  scale_color_manual(name=NULL, values = c(Average = "royalblue", Denoised = "#ffcc00"))+
  labs(title = "Average temperature", x = "Date", y = "Temperature", 
       subtitle = "Before and after the gaussian convolution filter")

Show code
temps_xts <- temps %>%
  select(T_AVG, Denoised, TREND, SEAS, RESID) %>%
  xts(order.by = temps$DAY) %>%
  tail(lookback)

# Plot seasonal decomposition from Avg to residuals
grid.arrange(nrow=5, top = paste0("Classical decomposition - last ",  lookback/365, " years"), 
             
             temps_xts$T_AVG %>% 
             quickplot(subtitle = "Average Temperature", show_legend = F, xlab = NULL, ylab = "Temps"),

             temps_xts$Denoised %>% 
             quickplot(subtitle = "Denoised", show_legend = F, xlab = NULL, ylab = "Temps"),

             temps_xts$TREND %>% 
             quickplot(subtitle = "Trend", show_legend = F, xlab = NULL, ylab = "Temps"),

             temps_xts$SEAS %>% 
             quickplot(subtitle = "Seasonal", show_legend = F, xlab = NULL, ylab = "Temps"), 
             
             temps_xts$RESID %>% 
             quickplot(subtitle = "Residuals", show_legend = F, xlab = NULL, ylab = "Temps"))

Show code
# Plot original vs. fitted data
ggplot(temps, aes(x = DAY)) +
  geom_point(aes(y = T_AVG), color = 'royalblue', size = 0.5) +
  geom_line(aes(y = BAR), color = 'orange', linewidth=2) +
  labs(title = "Temperature Model Fit (all Observations)", y = "Temperature (deg C)")

3.2.1 Check for possible model degradation

Checking for model degradation across time by looking at the first and last 10 years to see if there is a meaningful shift in where the sine curve and the average temperature

Show code
grid.arrange(nrow = 2, ncol = 2,
ggplot(temps %>% head(lookback), aes(x = DAY)) +
  geom_point(aes(y = T_AVG), color = 'royalblue', size = 0.5) +
  geom_line(aes(y = BAR), color = 'orange', linewidth=2) +
  labs(title = paste0("Temperature Model Fit (First ", lookback/365, " years)"),x=NULL, y = NULL),

ggplot(temps %>% head(lookback), aes(x = DAY)) +
  geom_line(aes(y = RESID), color = 'black', linewidth=0.5) +
  labs(title = paste0("Residuals (First ", lookback/365, " years)"),x=NULL, y = NULL),

ggplot(temps %>% tail(lookback), aes(x = DAY)) +
  geom_point(aes(y = T_AVG), color = 'royalblue', size = 0.5) +
  geom_line(aes(y = BAR), color = 'orange', linewidth=2) +
  labs(title = paste0("Temperature Model Fit (Last ", lookback/365, " years)"), x=NULL, y = NULL),

ggplot(temps %>% tail(lookback), aes(x = DAY)) +
  geom_line(aes(y = RESID), color = 'black', linewidth=0.5) +
  labs(title = paste0("Residuals (Last ", lookback/365, " years)"),x=NULL, y = NULL)
)

3.3 Residuals analysis and diagnostics

  1. Autocorrelation Function (ACF) Plot of Residuals:

    The ACF plot checks for any correlation between residuals at different lags. If residuals are uncorrelated (i.e., resemble white noise), all values should fall within the confidence bands, indicating no significant autocorrelation.

  2. Partial Autocorrelation Function (PACF) Plot of Residuals:

    The PACF plot displays the partial correlation of residuals with their own lagged values, considering the effect of any intermediate lags. Like the ACF plot, if the model fits well, the PACF values should also fall within the confidence bands, showing no significant partial autocorrelations.

  3. Normality Check using QQ Plot:

    The QQ plot assesses whether the residuals are normally distributed. Points should lie along the reference line; significant deviations from this line suggest that the residuals do not follow a normal distribution, which could imply model misspecification or the presence of outliers.

  4. Histogram of Residuals with Normal Curve:

    The histogram visualizes the distribution of residuals, and the overlaid normal curve helps assess normality. If the histogram closely follows the bell-shaped curve, it suggests normal residuals. The vertical line representing kurtosis indicates whether the residuals are more or less peaked than a normal distribution.

  5. Residuals vs. Fitted Values Plot:

    The plot checks for patterns in the residuals against fitted values. Ideally, the residuals should be randomly scattered around zero, indicating homoscedasticity (constant variance). Any visible pattern or trend suggests issues like non-linearity, heteroscedasticity, or omitted variables.

Show code
grid.arrange(nrow = 2, 
# ACF 
ggAcf(temps$RESID, lag.max = 100)+
  labs(title = "ACF of Residuals", x = NULL, y = NULL),

# PACF
ggPacf(temps$RESID, lag.max = 100)+
  labs(title = "PACF of Residuals", x = NULL, y = NULL),

## Check normality of residuals using QQ plot
ggplot(temps, aes(sample = RESID)) +
  stat_qq(color="royalblue")+
  stat_qq_line(color = "black", linewidth = 0.4)+
  labs(title = "QQ plot", x="Theoretical Quantiles", y= "Observed Quantiles"),

## Check for heteroskedasticity or any pattern in residuals
ggplot(temps %>% tail(lookback), aes(x=BAR, y = RESID))+
  geom_point(size = 0.4)+
  geom_smooth(method = "lm")+
  labs(title = paste0("Residuals vs Fitted - last ",  lookback/365, " years"), x= "Fitted Values", y = "Residuals"))
`geom_smooth()` using formula = 'y ~ x'

Show code
# Histogram with bell curve and kurtosis
ggplot(temps, aes(x = RESID)) +
  geom_histogram(aes(y = after_stat(density)), fill = "lightblue", bins = 30) +
  stat_function(fun = dnorm, args = list(mean = mean(temps$RESID), sd = sd(temps$RESID)), 
                color = "red", linewidth = 1.2) +
  # geom_vline(xintercept = skewness(temps$RESID)[1], linetype = "dashed", linewidth=1, color = "darkred")+
  labs(title = "Histogram of Residuals with Normal Curve", x = "Residuals", y = "Density", 
       # subtitle = "Vertical line is kurtosis"
       )

4 Ornstein-Uhlenbeck (OU) process

Temperature dynamics using a mean-reverting stochastic process like the Ornstein-Uhlenbeck (OU) process, incorporate the seasonal adjustment into the drift rate to ensure that the expected value of the temperature follows the seasonal mean temperature.

The Ornstein-Uhlenbeck (OU) process is a type of stochastic (random) process that is often used in physics, finance, and other fields to model systems that exhibit some form of “mean-reverting” behavior. Let’s start from the basics to understand what this means.

A stochastic process is a collection of random variables representing the evolution of a system over time. In simpler terms, it describes how something changes when randomness is involved. For example, the price of a stock or the position of a particle under the influence of random forces can be modeled as a stochastic process.

\[ dX_t = \kappa (\mu−X_t) \ d_t + \sigma \cdot dW_t \]

  • \(X_t\) is the value of the process at time t.
  • \(\mu\) is the long-term mean or equilibrium level toward which the process reverts.
  • \(\theta>0\) is the rate of mean reversion, determining how fast the process reverts to the mean P.
  • \(\sigma >0\) is the volatility or the intensity of the random fluctuations.
  • \(dW_t\) is an infinitesimal increment of a Wiener process (Brownian motion).

Mean Reversion Term: \(\kappa (\mu−X_t) \ d_t\) : Representing the tendency of the process to revert to the mean \(\mu\). If \(X_t\)is above the mean \(\mu\), this term will be negative (pulling it down toward the mean). If \(X_t\) is below \(\mu\), the term will be positive (pushing it up toward the mean).

The speed of this pull or push is determined by \(\kappa\). A higher \(\kappa\) means the process will revert to the mean faster.

Random Fluctuation Term \(\sigma \cdot dW_t\) : represents the random noise or shock. This is a source of randomness, and it causes the process to deviate randomly over time.

The size of the noise is controlled by \(\sigma\), which is the volatility parameter.

Show code
temps_OU <- temps
# Define parameters for the OU process
kappa <- 1-arima(temps_OU$RESID, order = c(1,0,0))$coef[1]  # Mean-reversion rate
sigma <- 0.1                                                # Volatility of the process
dt <- 1                                                     # Time step (daily data)

cat("Kappa is estimated as:", round(kappa,4))
Kappa is estimated as: 0.2431
Show code
# Initialize variables for simulation
n <- nrow(temps_OU)                      # Number of time points
T_simulated <- numeric(n)                # Simulated temperature vector
T_simulated[1] <- temps_OU$Denoised[1]   # Set initial temperature to the first observed value

# Simulate the seasonal mean as a time-varying mean (trend + seasonal component)
T_bar <- temps_OU$BAR

# Simulate the modified OU process
for (i in 2:n) {
  # Rate of change of the seasonal mean
  dT_bar_dt <- (T_bar[i] - T_bar[i - 1]) / dt
  
  # Brownian motion increment
  dWt <- rnorm(1, mean = 0, sd = sqrt(dt))
  
  T_simulated[i] <- T_simulated[i - 1] + 
                    (dT_bar_dt + kappa * (T_bar[i] - T_simulated[i - 1])) * dt + 
                    sigma * dWt
}

temps_OU$OU <- T_simulated

DATE <- tail(temps$DAY, lookback)

ggplotly(
  temps_OU %>%
    select(Denoised, OU, BAR) %>%
    tail(lookback) %>%
    round(4) %>%
    
    ggplot(aes(x = DATE)) +
    geom_point(aes(y = Denoised, color = "Observed"), size = 0.5) +
    geom_line(aes(y = OU, color = "Simulated"), linewidth = 0.8) +
    geom_line(aes(y = BAR, color = "Model fit"), linewidth = 0.5) +
    scale_color_manual(name = NULL, values = c(Observed = "blue", Simulated = "darkgreen", "Model fit" = "red")) +
    labs(title = "Simulated Ornstein-Uhlenbeck Process for Temperature - Last 10 years", x = "Day", y = "Temperature")
)

5 Sidequests

5.1 Sidequest: AR modeling

Fit an AR(1) model if there is significant autocorrelation

Show code
ar1_model <- arima(temps$RESID, order = c(1,0,0))

summary(ar1_model)

Call:
arima(x = temps$RESID, order = c(1, 0, 0))

Coefficients:
         ar1  intercept
      0.7569     0.0049
s.e.  0.0051     0.0659

sigma^2 estimated as 4.16:  log likelihood = -34531.51,  aic = 69069.01

Training set error measures:
                       ME     RMSE      MAE      MPE     MAPE      MASE
Training set 0.0002254017 2.039575 1.402991 35.10427 310.5978 0.9855704
                  ACF1
Training set 0.1806647
Show code
ar1_fitted <- fitted(ar1_model)
  
quickplot(ar1_fitted, title = paste0("AR(", length(ar1_model$coef), ") Model fitted"))

5.2 Sidequest: overlap of the dataset

One of the points that i would need for later is the standard deviation across the days of year, a pivot table is used for this step as with this I’m able to line up all the dataset month to month, by doing this im also able to analize seasonal patterns and look at the seasonality at a glance

Show code
pivot_df <- DATASET %>%
  select(DOY, YEAR, T_AVG) %>%
  pivot_wider(names_from = YEAR, values_from = T_AVG)


MAX_pivot_df <- DATASET %>%
  select(DOY, YEAR, T_MAX) %>%
  pivot_wider(names_from = YEAR, values_from = T_MAX)

MIN_pivot_df <- DATASET %>%
  select(DOY, YEAR, T_MIN) %>%
  pivot_wider(names_from = YEAR, values_from = T_MIN)


MEAN <- apply(pivot_df[-1], 1, mean, na.rm=TRUE)
MEDIAN <- apply(pivot_df[-1], 1, median, na.rm=TRUE)
IQR <- apply(pivot_df[-1], 1, IQR, na.rm=TRUE)
SD <- apply(pivot_df[-1], 1, sd, na.rm=TRUE)

data.frame(MEAN = MEAN,
           MEDIAN = MEDIAN) %>% 
  quickplot(title = "Mean and median across the year", xlab = "Day of the year", ylab = "Temperature")

Show code
data.frame(IQR = SMA(IQR, n = 7),
           SD = SD) %>% 
  quickplot(title = "Standard deviation and Inter-quartile-range across the year",
            subtitle = "IQR MA(7)", xlab = "Day of the year", ylab = "Temperature")
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_line()`).

Show code
data.frame(MAX = apply(pivot_df[-1], 1, max, na.rm=TRUE), 
           MIN = apply(pivot_df[-1], 1, min, na.rm=TRUE)) %>%
  mutate(AVG = (MAX + MIN)/2) %>% 
  mutate(RANGE = MAX - MIN) %>% 
  quickplot(title = "Range across the year", show_legend = T, xlab = "Day of the year", ylab = "Temperature")

Show code
melt(pivot_df[-1], id.vars = NULL) %>% 
ggplot(aes(x = variable, y = value)) +
  geom_boxplot(fill = "gray") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  
  labs(title = "Boxplot for all years", x = NULL, y = NULL)
Warning: Removed 271 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Show code
DATASET %>%
  select(Month, T_AVG) %>%
  group_by(Month) %>% 
  ggplot(aes(x = Month, y = T_AVG, group = Month)) +
  geom_boxplot(fill = "gray") +
scale_x_continuous(breaks = seq(1, 12, by = 1), labels = month.abb)+
  labs(title = "Boxplot for all months across the years", x = NULL, y = NULL)

5.2.1 How is this year compared to the rest

this dataset is 43 years long and there have been some really hot points and some really cold points over the years, so i wanted to look at my dataset and create a ribbon of all the max and mins, to see if this year has been so hot when looking at the averages, while this is in no way conclusive evidence it at least gives you a glimpse at how the climate is changing especially when looking at the middle of the year.

Show code
ggplotly(
  ggplot(DATASET, aes(x = DOY, y = T_AVG, group = YEAR, color = YEAR)) +
    geom_line() +
    labs(title = "All years overlapped", x = "Day of the year", 
         y = "Index level of returns"))
Show code
THISYEAR <- data.frame(c(DATASET[DATASET$YEAR == 2024,]["T_MAX"]),
                       c(DATASET[DATASET$YEAR == 2024,]["T_AVG"]),
                       c(DATASET[DATASET$YEAR == 2024,]["T_MIN"]))

ggplotly(cbind(pivot_df,
               "MIN" = apply(pivot_df[-1], 1, min, na.rm = TRUE),
               "MAX" = apply(pivot_df[-1], 1, max, na.rm = TRUE),
               "MEAN" = apply(pivot_df[-1], 1, mean, na.rm = TRUE)) %>%
           round(2) %>% 
    ggplot(aes(x = DOY)) +
    geom_ribbon(aes(ymin = MIN, ymax = MAX), alpha = 0.2) +
    geom_line(aes(y = MEAN, color = "MEAN"), linewidth = 1) +
    geom_line(aes(y = pivot_df$'2024', color = "AVG_Current"), linewidth = 1.3) +
    geom_line(aes(y = MAX_pivot_df$'2024', color = "MAX_Current"), linewidth = 0.6, alpha = 1) +
    geom_line(aes(y = MIN_pivot_df$'2024', color = "MIN_Current"), linewidth = 0.6, alpha = 1) +
    scale_color_manual(name = "Years", values =  c(MEAN = "orange2", AVG_Current = "olivedrab", 
                                                   MIN_Current = "lightblue", MAX_Current = "indianred")) +
    labs(title = "Is this year hotter on average?",
         y = NULL,      
         x = NULL)
)

5.3 Sidequest: check accuracy

Show code
T_SEAS <- params["alpha"] * sin(omega * temps$NUM_DAY + params["theta"])
T_TREND <- params["a"] + params["b"] * temps$NUM_DAY
T_BAR <- T_TREND + T_SEAS

data.frame(
  Trend = T_TREND, 
  Seasonal = T_SEAS,
  myfit = T_BAR,
  oldfit = fitted(fit)) %>% 
  quickplot(plot_engine = "plotly")
Show code
cat("MODEL VS OU\n")
MODEL VS OU
Show code
check_acc(fitted(fit), temps_OU$OU, 10)
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

Show code
cat("BAR VS OU\n")
BAR VS OU
Show code
check_acc(temps$BAR, temps_OU$OU, 10)
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

5.4 B-splines

Show code
library(splines)

# Define the number of knots
knots <- c(1, 3, 5, 10, 15, 20, 30, 50, 80)

# Function to create a spline model and plot
create_spline_plot <- function(knots, x, y) {
  # Fit the spline model
  spline_model <- lm(y ~ bs(x, knots = knots))
  yfit <- predict(spline_model, data.frame(x = x))
  
  # Calculate Residual Sum of Squares (RSS)
  rss <- sum((y - yfit)^2)

  # Create the plot
  plots <- ggplot() +
    geom_point(aes(x, y), color = 'cornflowerblue', size = 1.5) +
    geom_line(aes(x, yfit), color = 'black', linewidth = 1) +
    ggtitle(paste("Knots #", knots, "\nRSS:", round(rss, 2))) +
    labs(y = "Temps", x = NULL) +
    theme_classic()+
    theme(plot.title = element_text(size = 10, face = "bold"))
  
  return(plots)
}

# Generate all the plots
plots <- lapply(knots, create_spline_plot, x = 1:366, y = SD)

# Arrange and display the plots in a 2x3 grid
grid.arrange(grobs = plots, nrow = 3, ncol = 3)

Show code
beepr::beep(sound = 3)